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Abstract 

We propose Imaginet, a model of learn¬ 
ing visually grounded representations of 
language from coupled textual and visual 
input. The model consists of two Gated 
Recurrent Unit networks with shared word 
embeddings, and uses a multi-task objec¬ 
tive by receiving a textual description of 
a scene and trying to concurrently predict 
its visual representation and the next word 
in the sentence. Mimicking an important 
aspect of human language learning, it ac¬ 
quires meaning representations for indi¬ 
vidual words from descriptions of visual 
scenes. Moreover, it learns to effectively 
use sequential structure in semantic inter¬ 
pretation of multi-word phrases. 

1 Introduction 

Vision is the most important sense for humans 
and visual sensory input plays an important role 
in language acquisition by grounding meanings of 
words and phrases in perception. Similarly, in 
practical applications processing multimodal data 
where text is accompanied by images or videos is 
increasingly important. In this paper we propose 
a novel model of learning visually-grounded rep¬ 
resentations of language from paired textual and 
visual input. The model learns language through 
comprehension and production, by receiving a tex¬ 
tual description of a scene and trying to “imagine” 
a visual representation of it, while predicting the 
next word at the same time. 

The full model, which we dub iMAGiNET, con¬ 
sists of two Gated Recurrent Unit (GRU) networks 
coupled via shared word embeddings. IMAGINET 
uses a multi-task Caruana (1997) objective: both 
networks read the sentence word-by-word in par¬ 
allel; one of them predicts the feature represen¬ 
tation of the image depicting the described scene 


after reading the whole sentence, while the other 
one predicts the next word at each position in the 
word sequence. The importance of the visual and 
textual objectives can be traded off, and either of 
them can be switched off entirely, enabling us to 
investigate the impact of visual vs textual infor¬ 
mation on the learned language representations. 

Our approach to modeling human language 
learning has connections to recent models of im¬ 
age captioning (see Section 2). Unlike in many of 
these models, in IMAGINET the image is the target 
to predict rather then the input, and the model can 
build a visually-grounded representation of a sen¬ 
tence independently of an image. We can directly 
compare the performance of IMAGINET against a 
simple multivariate linear regression model with 
bag-of-words features and thus quantify the con¬ 
tribution of the added expressive power of a recur¬ 
rent neural network. 

We evaluate our model’s knowledge of word 
meaning and sentence structure through simulat¬ 
ing human judgments of word similarity, retriev¬ 
ing images corresponding to single words as well 
as full sentences, and retrieving paraphrases of im¬ 
age captions. In all these tasks the model outper¬ 
forms the baseline; the model significantly corre¬ 
lates with human ratings of word similarity, and 
predicts appropriate visual interpretations of sin¬ 
gle and multi-word phrases. The acquired knowl¬ 
edge of sentence structure boosts the model’s per¬ 
formance in both image and caption retrieval. 

2 Related work 

Several computational models have been proposed 
to study early language acquisition. The acqui¬ 
sition of word meaning has been mainly mod¬ 
eled using connectionist networks that learn to 
associate word forms with semantic or percep¬ 
tual features (e.g., Li et al., 2004; Coventry et al., 
2005; Regier, 2005), and rule-based or proba¬ 
bilistic implementations which use statistical reg- 


ularities observed in the input to detect associa¬ 
tions between linguistic labels and visual features 
or concepts (e.g., Siskind, 1996; Yu, 2008; Fazly 
et al., 2010). These models either use toy lan¬ 
guages as input (e.g., Siskind, 1996), or child- 
directed utterances from the CHILDES database 
(MacWhinney, 2014) paired with artificially gen¬ 
erated semantic information. Some models have 
investigated the acquisition of terminology for vi¬ 
sual concepts from simple videos (Fleischman 
and Roy, 2005; Skocaj et al., 2011). Lazaridou 
et al. (2015) adapt the skip-gram word-embedding 
model (Mikolov et al., 2013) for learning word 
representations via a multi-task objective similar 
to ours, learning from a dataset where some words 
are individually aligned with corresponding im¬ 
ages. All these models ignore sentence structure 
and treat inputs as bags of words. 

A few models have looked at the concurrent ac¬ 
quisition of words and some aspect of sentence 
structure, such as lexical categories (Alishahi and 
Chrupala, 2012) or syntactic properties (Howell 
et al., 2005; Kwiatkowski et al., 2012), from utter¬ 
ances paired with an artificially generated repre¬ 
sentation of their meaning. To our knowledge, no 
existing model has been proposed for concurrent 
learning of grounded word meanings and sentence 
structure from large scale data and realistic visual 
input. 

Recently, the engineering task of generating 
captions for images has received a lot of atten¬ 
tion (Karpathy and Fei-Fei, 2014; Mao et al., 
2014; Kiros et al., 2014; Donahue et al., 2014; 
Vinyals et al., 2014; Venugopalan et al., 2014; 
Chen and Zitnick, 2014; Fang et al., 2014). From 
the point of view of modeling, the research most 
relevant to our interests is that of Chen and Zitnick 
(2014). They develop a model based on a context- 
dependent recurrent neural network (Mikolov and 
Zweig, 2012) which simultaneously processes tex¬ 
tual and visual input and updates two parallel hid¬ 
den states. Unlike theirs, our model receives the 
visual target only at the end of the sentence and is 
thus encouraged to store in the final hidden state 
of the visual pathway all aspects of the sentence 
needed to predict the image features successfully. 
Our setup is more suitable for the goal of learning 
representations of complete sentences. 

3 Models 

IMAGINE! consists of two parallel recurrent path¬ 



Figure 1: Structure of IMAGINE! 


ways coupled via shared word embeddings. Both 
pathways are composed of Gated Recurrent Units 
(GRU) first introduced by Cho et al. (2014) and 
Chung et al. (2014). GRUs are related to the 
Long Short-Term Memory units (Hochreiter and 
Schmidhuber, 1997), but do not employ a sepa¬ 
rate memory cell. In a GRU, activation at time t is 
the linear combination of previous activation, and 
candidate activation: 

ht = (1 - zt) 0 ht_i + zt 0 ht (1) 

where © is elementwise multiplication. The up¬ 
date gate determines how much the activation is 
updated: 

Zt = + U,ht_i) (2) 

The candidate activation is computed as: 

ht = cr(Wxt + U(rt 0 ht_i)) (3) 

The reset gate is defined as: 

Yt = as{WrXt + Urht_i) (4) 

Our gated recurrent units use steep sigmoids for 
gate activations: 

1 + exp(—3.752:) 

and rectified linear units clipped between 0 and 5 
for the unit activations: 

a(z) = clip(0.5(z + abs(z)), 0, 5) 

Figure 1 illustrates the structure of the network. 
The word embeddings is a matrix of learned pa¬ 
rameters We with each column corresponding to a 
vector for a particular word. The input word sym¬ 
bol St of sentence S at each step t indexes into the 
embeddings matrix and the vector forms input 
to both GRU networks: 

Xi = We[:,5t] 


(5) 

































This input is mapped into two parallel hidden 
states, hY along the visual pathway, and along 
the textual pathway: 

hl" = GRU^(hl"_i,xt) (6) 

hf = GRU^(h?:i,xt) (7) 

The final hidden state along the visual pathway 
is then mapped to the predicted target image rep¬ 
resentation i by the fully connected layer with pa¬ 
rameters V and the clipped rectifier activation: 

i = <7(Vhj:) (8) 

Each hidden state along the textual pathway is 
used to predict the next symbol in the sentence S 
via a softmax layer with parameters L: 

p(S't+i|S'i:t) = softmax(Lh^) (9) 

The loss function whose gradient is backpropa- 
gated through time to the GRUs and the embed¬ 
dings is a composite objective with terms penaliz¬ 
ing error on the visual and the textual targets si¬ 
multaneously: 

L{e) = aL^{e) + (1 - a)L^{e) ( 10 ) 

where 9 is the set of all Imaginet parameters. 
is the cross entropy function: 

L'^i0) = --f2^ogpiSt\Si:t) (11) 

^ t=l 

while is the mean squared error: 

1 ^ 

= ( 12 ) 

k=l 

By setting a to 0 we can switch the whole textual 
pathway off and obtain the VISUAL model vari¬ 
ant. Analogously, setting a to 1 gives the Tex¬ 
tual model. Intermediate values of a (in the ex¬ 
periments below we use 0.1) give the full Mul¬ 
titask version. Finally, as baseline for some of 
the tasks we use a simple linear regression model 
LinReg with a bag-of-words representation of the 
sentence: 

i = Ax -h b (13) 

where i is the vector of the predicted image fea¬ 
tures, X is the vector of word counts for the in¬ 
put sentence and (A, 6) the parameters of the 
linear model estimated via L 2 -penalized sum-of- 
squared-errors loss. 


SimLex MEN 3K 

Visual 

0.32 

0.57 

Multitask 

0.39 

0.63 

Textual 

0.31 

0.53 

LinReg 

0.18 

0.23 


Table 1: Word similarity correlations with human 
judgments measured by Spearman’s p (all correla¬ 
tions are significant at level p < 0.01). 

4 Experiments 

Settings The model was implemented in Theano 
(Bastien et al., 2012; Bergstra et al., 2010) and op¬ 
timized by Adam (Kingma and Ba, 2014).^ The 
fixed 4096-dimensional target image representa¬ 
tion come from the pre-softmax layer of the 16- 
layer CNN (Simonyan and Zisserman, 2014). We 
used 1024 dimensions for the embeddings and for 
the hidden states of each of the GRU networks. We 
ran 8 iterations of training, and we report either 
full learning curves, or the results for each model 
after iteration 7 (where they performed best for the 
image retrieval task). For training we use the stan¬ 
dard MS-COCO training data. For validation and 
test, we take a sample of 5000 images each from 
the validation data. 

4.1 Word representations 

We assess the quality of the learned embeddings 
for single words via two tasks: (i) we measure 
similarity between embeddings of word pairs and 
compare them to elicited human ratings; (ii) we 
examine how well the model learns visual repre¬ 
sentations of words by projecting word embed¬ 
dings into the visual space, and retrieving images 
of single concepts from ImageNet. 

Word similarity judgment For similarity judg¬ 
ment correlations, we selected two existing bench¬ 
marks that have the largest vocabulary overlap 
with our data: MEN 3K (Bruni et al., 2014) and 
SimLex-999 (Hill et al., 2014). We measure the 
similarity between word pairs by computing the 
cosine similarity between their embeddings from 
three versions of our model, VISUAL, Multi- 
Task and Textual, and the baseline LinReg. 

Table 1 summarizes the results. All iMAGiNET 
models significantly correlate with human simi¬ 
larity judgments, and outperform LinReg. Ex¬ 
amples of word pairs for which MULTITASK cap- 

^Code available at github.com/gchmpala/imaginet. 






Visual Multitask LinReg 
0.38 0.38 0.33 

Table 2: Accuracy @5 of retrieving images with 
compatible labels from ImageNet. 

tures human similarity judgments better than VI¬ 
SUAL include antonyms {dusk, dawn), colloca¬ 
tions {sexy, smile), or related but not visually sim¬ 
ilar words {college, exhibition). 

Single-word image retrieval In order to visual¬ 
ize the acquired meaning for individual words, we 
use images from the ILSVRC2012 subset of Im¬ 
ageNet (Russakovsky et al., 2014) as benchmark. 
Labels of the images in ImageNet are synsets from 
WordNet, which identify a single concept in the 
image rather than providing descriptions of its 
full content. Since the synset labels in ImageNet 
are much more precise than the descriptions pro¬ 
vided in the captions in our training data (e.g., 
elkhound), we use synset hypernyms from Word- 
Net as substitute labels when the original labels 
are not in our vocabulary. 

We extracted the features from the 50,000 im¬ 
ages of the ImageNet validation set. The labels 
in this set result in 393 distinct (original or hyper- 
nym) words from our vocabulary. Each word was 
projected to the visual space by feeding it through 
the model as a one-word sentence. We ranked 
the vectors corresponding to all 50,000 images 
based on their similarity to the predicted vector, 
and measured the accuracy of retrieving an image 
with the correct label among the top 5 ranked im¬ 
ages (Accuracy@5). Table 2 summarizes the re¬ 
sults: Visual and Multitask learn more accu¬ 
rate word meaning representations than LinReg. 

4.2 Sentence structure 

In the following experiments, we examine the 
knowledge of sentence structure learned by Imag- 
INET, and its impact on the model performance on 
image and paraphrase retrieval. 

Image retrieval We retrieve images based on 
the similarity of their vectors with those predicted 
by IMAGINET in two conditions: sentences are fed 
to the model in their original order, or scrambled. 
Figure 2 (left) shows the proportion of sentences 
for which the correct image was in the top 5 high¬ 
est ranked images for each model, as a function of 
the number of training iterations: both models out- 




Figure 2: Left: Accuracy @5 of image retrieval 
with original versus scrambled captions. Right: 
Recall @4 of paraphrase retrieval with original 
vs scrambled captions. 

perform the baseline. MULTITASK is initially bet¬ 
ter in retrieving the correct image, but eventually 
the gap disappears. Both models perform substan¬ 
tially better when tested on the original captions 
compared to the scrambled ones, indicating that 
models learn to exploit aspects of sentence struc¬ 
ture. This ability is to be expected for Multi- 
Task, but the Visual model shows a similar ef¬ 
fect to some extent. In the case of Visual, this 
sensitivity to structural aspects of sentence mean¬ 
ing is entirely driven by how they are reflected in 
the image, as this models only receives the visual 
supervision signal. 

Qualitative analysis of the role of sequential 
structure suggests that the models are sensitive 
to the fact that periods terminate a sentence, that 
sentences tend not to start with conjunctions, that 
topics appear in sentence-initial position, and that 
words have different importance as modifiers ver¬ 
sus heads. Figure 3 shows an example; see supple¬ 
mentary material for more. 

iMAGiNET vs captioning systems While it is 
not our goal to engineer a state-of-the-art image 
retrieval system, we want to situate iMAGiNET’s 
performance within the landscape of image re¬ 
trieval results on captioned images. As most of 
these are on Flickr30K (Young et al., 2014), we 
ran MULTITASK on it and got an accuracy@5 of 
32%, within the range of numbers reported in pre¬ 
vious work: 29.8% (Socher et al., 2014), 31.2% 
(Mao et al., 2014), 34% (Kiros et al., 2014) and 
37.7% (Karpathy and Fei-Fei, 2014). Karpathy 
and Fei-Fei (2014) report 29.6% on MS-COCO, 
but with additional training data. 


























Original 

a couple of horses UNK their head over a rock pile 

rank 1 
rank 2 

two brown horses hold their heads above a rocky wall. 
two horses looking over a short stone wall. 

Scrambled 

rock couple their head pile a a UNK over of horses 

rank 1 
rank 2 

an image of a man on a couple of horses 
looking in to a straw lined pen of cows 

Original 

a cute baby playing with a cell phone 

rank 1 
rank 2 

small baby smiling at camera and talking on phone . 
a smiling baby holding a cell phone up to ear . 

Scrambled 

phone playing cute cell a with baby a 

rank 1 
rank 2 

someone is using their phone to send a text or play a game . 
a camera is placed next to a cellular phone . 


Table 3: Examples of two nearest neighbors retrieved by MULTITASK for original and scrambled cap¬ 
tions. 


a variety of kitchen utensils hanging from a UNK board ” 


“kitchen of from hanging UNK variety a board utensils a 


Figure 3: For the original caption MULTITASK un¬ 
derstands kitchen as a modifier of headword uten¬ 
sils, which is the topic. For the scrambled sen¬ 
tence, the model thinks kitchen is the topic. 


Paraphrase retrieval In our dataset each image 
is paired with five different captions, which can 
be seen as paraphrases. This affords us the op¬ 
portunity to test Imaginet’s sentence represen¬ 
tations on a non-visual task. Although all mod¬ 
els receive one caption-image pair at a time, the 
co-occurrence with the same image can lead the 
model to learn structural similarities between cap¬ 
tions that are different on the surface. We feed 
the whole set of validation captions through the 
trained model and record the final hidden visual 
state . For each caption we rank all others ac¬ 
cording to cosine similarity and measure the pro¬ 
portion of the ones associated with the same image 
among the top four highest ranked. For the scram¬ 
bled condition, we rank original captions against 
a scrambled one. Figure 2 (right) summarizes the 
results: both models outperform the baseline on 
ordered captions, but not on scrambled ones. As 
expected. Multitask is more affected by manip¬ 
ulating word order, because it is more sensitive to 


structure. Table 3 shows concrete examples of the 
effect of scrambling words in what sentences are 
retrieved. 

5 Discussion 

IMAGINET is a novel model of grounded lan¬ 
guage acquisition which simultaneously learns 
word meaning representations and knowledge of 
sentence structure from captioned images. It 
acquires meaning representations for individual 
words from descriptions of visual scenes, mim¬ 
icking an important aspect of human language 
learning, and can effectively use sentence structure 
in semantic interpretation of multi-word phrases. 
In future we plan to upgrade the current word- 
prediction pathway to a sentence reconstruction 
and/or sentence paraphrasing task in order to en¬ 
courage the formation of representations of full 
sentences. We also want to explore the acquired 
structure further, especially for generalizing the 
grounded meanings to those words for which vi¬ 
sual data is not available. 
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A Image retrieval with single words 





Keyword: 

dessert 

parrot 

Original label: 

ice cream 

macaw 

Hypernym: 

dessert 

parrot 





Keyword: 
Original label: 
Hypemym: 


locomotive 
steam locomotive 
locomotive 


bicycle 

bicycle-built-for-two 

bicycle 



Keyword: parachute 

Original label: parachute 


snowmobile 

snowmobile 


Figure 4: Sample images for single words. Under the images are the keywords that were used for the 
retrieval, the original label of the images and if it was not in our vocabulary its hypernym is included. 


We visualize the acquired meaning of individual words using images from the ILSVRC2012 subset of 
ImageNet (Russakovsky et al., 2014). Labels of the images in ImageNet are synsets from WordNet, 
which identify a single concept in the image rather than providing descriptions of its full content. When 
the synset labels in ImageNet are too specific and cannot be found in our vocabulary, we replace them 
with their hypernyms from WordNet. 

Figure 4 shows examples of images retrieved via projections of single words into the visual space 
using the MULTITASK model. As can be seen, the predicted images are intuitive. For those for which 
we use the hypernym as key, the more general term (e.g. parrot) is much more common in humans’ daily 
descriptions of visual scenes than the original label used in ImageNet (e.g. macaw). The quantitative 
evaluation of this task is reported in the body of the paper. 

B Effect of scrambling word order 

In Figures 5-7 we show some illustrative cases of the effect for image retrieval of scrambling the input 
captions to the MULTITASK model trained on un-scrambled ones. These examples suggest that the model 
learns a number of facts about sentence structure. They range from very obvious, e.g. periods terminate 
sentences, to quite interesting, such as the distinction between modifiers and heads or the role of word 
order in encoding information structure (i.e. the distinction between topic and comment). 












a pigeon with red feet perched on a wall. 



feet on wall . pigeon a red with a perched 



Figure 5: In the scrambled sentence, the presence of a full stop in the middle of a sentence causes all 
material following it to be ignored, so the model finds pictures with wall-like objects. 

C Propagating distributional information through Multi-Task objective 

Table 4 lists example word pairs for which the MULTITASK model matches human judgments closer 
than the VISUAL model. Some interesting cases are words which are closely related but which have the 
opposite meaning {dawn, dusk), or words which denote entities from the same broad class, but which 
are visually very dissimilar {insect, lizard). There are, however, also examples where there is no obvious 
prior expectation for the MULTITASK model to do better, e.g. {maple, oak). 


Word 1 

Word 2 

Human 

Multitask 

Visual 

construction 

downtown 

0.5 

0.5 

0.2 

sexy 

smile 

0.4 

0.4 

0.2 

dawn 

dusk 

0.8 

0.7 

0.4 

insect 

lizard 

0.6 

0.5 

0.2 

dawn 

sunrise 

0.9 

0.7 

0.4 

collage 

exhibition 

0.6 

0.4 

0.2 

bikini 

swimsuit 

0.9 

0.7 

0.4 

outfit 

skirt 

0.7 

0.5 

0.2 

sun 

sunlight 

1.0 

0.7 

0.4 

maple 

oak 

0.9 

0.5 

0.2 

shirt 

skirt 

0.9 

0.4 

0.1 


Table 4: A sample of word pairs from the MEN 3K dataset for which the MULTITASK model matches 
human judgments better than VISUAL. All scores are scaled to the [0,1] range. 



















blue and silver motorcycle parked on pavement under plastic awning . 






pavement silver awning and motorcycle blue on under plastic . parked 


Figure 6: The model understands that motorcycle is the topic, even though it’s not the very first word. In 
the scrambled sentence is treats pavement as the topic. 


a brown teddy bear laying on top of a dry grass covered ground . 



a a of covered laying bear on brown grass top teddy ground . dry 



Figure 7: The model understands the compound teddy bear. In the scrambled sentence, it finds a picture 
of real bears instead. 

















